SARS-CoV-2 Viral Load in the Nasopharynx at Time of First Infection Among Unvaccinated Individuals

Key Points Question What factors are associated with SARS-CoV-2 viral load at the time of COVID-19 diagnosis, and is viral load associated with disease severity? Findings In this secondary cross-protocol analysis of 1667 placebo recipients from 4 harmonized, randomized, phase 3 COVID-19 vaccine efficacy trials, no associations were found between viral load and any of the measured covariates or disease severity. Meaning The findings of this study suggest that caution should be exercised in the use of individual-level viral load in comparisons across trials and/or settings and as a surrogate for COVID-19 severity, especially given increasing diversity in preexisting immunity.

In this analysis, we used the same harmonized COVID-19 comorbidity and SARS-CoV-2 exposure risk definitions defined in Theodore et al. 1 For convenience, these definitions are reproduced (with author permission) below:

Baseline Comorbidities
In this analysis, comorbid conditions (yes or no) were defined as the presence of a given medical condition indicated in either the medical history eCRF or the comorbid questionnaire CRF data.Comorbid conditions listed by CDC as associated with severe COVID-19 were first mapped to MedDRA coding.Specifically, the CDC updated on Feb 15, 2022, the listing of underlying medical conditions associated with higher risk for severe COVID-19 (https://www.cdc.gov/coronavirus/2019-ncov/science/sciencebriefs/underlying-evidence-table.html).All of the medical conditions in this CDC list were mapped to MedDRA 24.0-English version coding for the applicable Preferred Term (PT) and High-level Term (HLT) for each medical condition.The medical diagnosis listed in medical history eCRF of each participant were available as MedDRA terms.The frequencies of occurrence of each of these conditions were tabulated and included as an independent variable.The medical conditions and subcategories with a frequency of occurrence <5% were combined or eliminated for the construction of variables analyzed in the final dataset (Supplementary Table 4 from Theodore et al.)The mapped coding was then used to determine the presence of each condition as listed in the medical history eCRF.

Occupational Risk
Occupational risk was determined by attributing Occupational Safety and Health Administration (OSHA) hazard recognition scores to self-reported workplace information provided by participants.OSHA functions as a regulatory agency under the United States Department of Labor to ensure safe and healthful working conditions, and as such, defined categories in response to the Covid-19 pandemic to aid in the assessment and mitigation of exposure risk in the workplace.Low exposure risk jobs have minimal contact with the public or coworkers.Medium exposure risk jobs have frequent or sustained close contact with the public or coworkers in outdoor or well-ventilated settings.High exposure risk jobs include close or poorly ventilated working conditions with known or suspected sources of SARS-CoV-2 (such as a hospital, grocery store or public transit).Very high exposure risk jobs are performing specific medical, postmortem or laboratory procedures.If individuals selected more than one category, the maximum score was taken.High and very high exposure risk categories were combined for analysis.

Living Situation Risk
Living situation risk synthesizes variables across all four trials and is scored on a scale of low, medium, high, or very high risk.It is based on housing type for the Moderna trial and number of co-habitants for the other 3 trials.For the AstraZeneca, Janssen, and Novavax trials, low, medium, high, and very high risk conditions corresponded to 0-1, 2, 3 and 4 or more co-habitants, respectively.For the Moderna study, individuals self-reported the housing type(s) that applied.Each housing type was assigned to low (participants specified as not having risk of exposure related to housing), medium (single-family or detached housing, housing without shared entrances or elevators), high (congregate settings such as dormitory, group housing, or high density such as apartments with shared entrances or elevators), or very high (nursing homes, long-term care facilities, shelters, and multi-family dwellings) risk categories.If participants selected more than one housing type, the highest risk score was taken.If "other" was selected, a value was imputed using the most frequent category within a given study.

Protocol-specific schedule of illness visits, specimen types, and viral load quantitation
Although COVID-19 endpoint definitions were harmonized across the four trials, the schedule of illness visits, and specimen types varied by parent protocol.Moderna collected a nasal/nasopharyngeal (NP) swab on illness visit 1, followed by saliva on illness visit days 3, 5, 7, 9, 14, 21, and 28.AstraZeneca collected nasal/NP swabs on illness visit days 1, 14, 21, and 28; and saliva on days 1, 3, 5, 8, 11, 14, 21, and 28.Janssen collected nasal/NP swabs on illness visit 1, and every other illness visit day through day 23.Novavax collected nasal/NP swabs only from illness visit days 1, 2, and 3 (eFigure 1).eFigure 1. Summary of protocol-defined sampling collection for illness visits, triggered by symptom onset.Nasal and nasopharyngeal (NP) swabs are denoted by a head with swab, and saliva specimens are denoted by a mouth with a tube.Specimens collected from participants enrolled in the Moderna trial were quantified by Eurofins Viracor. 2 The RT-PCR assay targets two genes (N1 and N2) in a single channel and conversion to a standardized viral load has been previously described. 3Specimens collects from participants of the AstraZeneca trial were quantified at LabCorp using an RT-PCR assay targeting the E and ORF1ab genes separately.University of Washington Virology (UWVL) quantified specimens from both Janssen and Novavax using the Abbott m-2000 SARS-CoV-2 real-time RT-PCR, which targets the N1 and N2 genes in a single channel. 4For specimens from the Janssen trial, swabs were first tested at a local lab and remnants of locally positive swabs were sent to University of Washington for quantification and sequencing (if selected).Because of the international nature of this trial, not all study sites had access to suitable local PCR testing capacity.If local testing was not available, specimens were shipped to Covance/Labcorp for qualitative PCR testing, and presumptive positive specimens were then sent to University of Washington for quantification.As a result, swabs that were not tested locally underwent an additional freeze-thaw cycle before quantitation at University of Washington.

Urchin -A Tool for Predicting SARS-CoV-2 Variants from Spike Sequences
The gold-standard approach to determine the lineage (and hence WHO variant status) of a SARS-CoV-2 sequence is with either the PANGOLIN or NextClade software tools.Because these tools provide information about the specific lineage of a given virus, they require a whole-genome sequence as input.These studies focused on the spike protein, and as such, sequences for some studies, for various reasons, were only available as the S gene (nucleotide sequence) or the spike protein (protein sequence).We needed a way to determine the WHO-defined variant label for these spike-only sequences.
We accomplished this with a predictive model, which we implemented as a Shiny 5 -based web tool named Urchin.Urchin uses an optimized learner for predicting the WHO Greek-lettered variant name of a given SARS-CoV-2 spike protein sequence.In this particular analysis, Urchin's predictions were used to determine the variant labels for sequences from the Moderna (n = 790) and AstraZeneca (n = 680) studies.
To train and validate the Urchin model we started with a corpus of approximately 1.93 million sequences obtained from the GISAID database 6 on August 21, 2021 (EPI-SET doi 10.55876/gis8.230822ky).We later updated our data set to account for emergence of the Omicron variants, specifically 10,000 sequences of both BA.1 and BA.2 along with the 239 BA.3 sequences available at the date of retrieval (March 20, 2022).These sequences were aligned using the MAFFT multiple sequence alignment program, 7 with manual touch-ups as needed.From here, we generated a feature set by rendering the sequence data into a set of binary indicator variables for each position, leading to a feature for each observed amino acid at every site (e.g., "position 614 is 'G'"), excluding gaps ("-") and unknown amino acids ("X").As a dimensionality reduction step, features that were within 20 instances of being perfectly homogenous were screened out.To ensure that signature sites characteristic of variant definitions would not be accidentally filtered out, we identified a set of signature sites for all extant variants and prespecified them to be included as training features, regardless of whether they would pass the dimensionality reduction filter.These signature sites are enumerated in eTable 2. This dimensionality reduction step reduced the derivation set from 17,488 features down to 9,467 features.We then conducted a 70:30 training:validation split of the data, using random selection and stratifying by variant.This resulted in a training set of 1,348,494 sequences and a validation set of 577,927 sequences.eTable 2. Variant-specific signature sites that were included in the homogeneity screening process (all positions indexed to the NC_045512 reference strain) 8  To optimize our model and eliminate any bias that might occur from variants that were over-represented in the database, we created an exploration set from the training set by randomly sampling N sequences for each variant, where N is equal to 38, the frequency of the least-represented variant in the training data (Theta).Since the training data contained sequences for 15 variants (including the Ancestral lineage), this resulted in an exploration set containing 570 sequences.This exploration set was then used as the training set for an initial learner.

WHO Label Spike
We approached this as a multi-class problem, to develop a learner that would predict the Greek-lettered variant of any sequence, as provided by GISAID's metadata file.All sequences from the A.1 Ancestral lineage and the B.1 basal outbreak lineage (with the G614G mutation) were grouped into a single "Ancestral Strain" variant category.For our learning method, due to the discrete nature of mutations and their associations with variants, we selected extreme gradient boosting (XGBoost 9 ) with the multiclass log loss evaluation metric and 50 rounds of boosting iterations.
We performed 5-fold cross validation to estimate the error with the exploration set.This resulted in a cross-validated predictive accuracy (the proportion of correct predictions) of 0.988 (95% CI: 0.975, 0.995).Using the same hyperparameters, we trained a single model to predict the variant of the validation set, resulting in a predictive accuracy of 0.993 (0.9927, 0.9931).Only 79 of the total 9467 features held predictive importance in this model and thus were used to define our new feature set.These final features (79 residues across 60 unique positions) are enumerated in eTable 3.  Validating this model on the holdout set of 577,927 sequences, it performed with a predictive accuracy of 0.9929 (0.9927, 0.9931).This is the final model that was selected for use with the Urchin web tool.
The results across both studies then underwent a phylogenetic analysis in order to investigate any potential miscalls from Urchin.Both sequence sets were combined and phylogenetic trees were generated using the PhyML software with the Blosum62 model.From these trees, 20 sequences predicted to be of the Ancestral Strain by Urchin appeared to be miscalls, and this was confirmed with a tree built from a larger set of sequences.The 20 miscalls were confirmed to be: 9 Lambda, 5 Iota, 4 Beta, and 2 which were either Gamma or Zeta.Of these 20 miscalls, 6 were from the Moderna study (0.7% miscalls) and 14 from AstraZeneca (2% miscalls).The miscalls were due to either missing sequence content, rare mutations, or, in the case of Lambda, a 13-AA deletion covering spike positions 64 through 76 that is inconsistently found in Lambda and was confusing our learner by defeating the characteristic G75V mutation.
This predictor is available online as the Urchin web tool, which can be accessed at https://urchin.fredhutch.org/.Users can submit a set of spike sequence data in FASTA format and Urchin will return the predicted WHO variant labels.The sequences do not need to be aligned, and they are each transformed into the 79 features required by the learner.The final model uses these features to predict the variant label for each submitted sequence.The results are reported back to the user and can be downloaded as a CSV file.Urchin is available to the public for the sake of reproducibility, https://urchin.fredhutch.org/ as it has not been retrained for recent variants (particularly emerging Omicron subvariants and recombinants), so it will be of limited use with modern sequences.The source code is available at https://github.com/jamesprg/urchin.

Multiple imputation of SARS-CoV-2 Variants
While all protocols prioritized obtaining a sequence from each infection observed in the trial, successful sequencing was not always possible.There are several reasons why it was not possible to obtain a viral sequence, even from repeated sampling over the course of an infection.In general, sequencing was not even attempted if the specimen was found to have a low viral load (or high cycle threshold), although the specific sequencing thresholds will vary by lab.Sequencing could also be missing for other reasons, including low sample volume or poor specimen quality.As a result, sequencing was not available for 20.6% of the analysis cohort and were classified as such for the primary analyses.
The extensive genomic data that was collected, sequenced, and openly shared from specimens worldwide through GISAID provided an opportunity for imputation analyses to fill in those missing variants.As the specimen level, the information is highly variable in the GISAID metadata; as such, we focused on the strongest predictors of circulating viruses: space and time.
We used GISAID metadata of specimens from individuals in the 8 countries represented in our analysis population: Argentina, Brazil, Chile, Colombia, Mexico, Peru, South Africa, and the United States of America.The specimens in our analysis cohort were classified as one of the WHO-named variants (Alpha, Beta, Delta, Gamma, etc.) or Other.Since this analysis was limited to the early part of the pandemic, the Other category was primarily composed of Ancestral lineages, but had other minor lineages that did not reach the level of a named variant.We excluded specimens with impossible variant-date combinations, such as Delta in November of 2020.
For imputation analyses, we supplemented the analysis data with the estimated proportion of circulating variants at the time of infection, based on the local GISAID metadata.In particular, for each infection in the analysis dataset, we estimate the proportion of cases attributed to each variant by summarizing a two-week window around the date of COVID-19 onset in local GISAID data.For swabs collected from participants within the United States, state-level GISAID data was used, while country-level GISAID data was used for all other infections.eFigure 2 provides a visual representation of the available data used to estimate the distribution of circulating variants by infection in Chile.In the example, we would attribute all of the circulating cases to the Other category for the first infection (left-most red vertical line); in contrast, for the last infection missing a variant (the rightmost vertical line), the estimated distribution of variants at that time would inclue nearly all of the observed variants (except Delta), to varying degrees.
eFigure 2. Example of data used for imputation analyses.Gray dots represent the observed variants sequencing in Chile over time in the GISAID database, blue dots indicate samples from the analysis dataset that have successfully been sequenced, and the red dashed lines denote the dates of illness visit day 1 swabs that are missing sequencing information.Other includes Ancestral strain, as well as all other non-WHO-named lineages.
We used these estimated proportions of circulating variants at the time of infection to do a simple imputation analysis.Specifically, we imputed the missing variant by taking a random draw from the multinomial distribution, with variant probabilities defined by estimated proportions of circulating variants at the time of infection and fitting the multivariate regression model (Figure 4).Regression results over 20 imputed datasets were then combined and summarized to account for the additional variability using the mice package in R. 10 Global nominal p-values were defined as the median of the multiple p-values obtained from Wald tests for each imputed dataset only when standard approaches failed. 11

Sensitivity and Exploratory analyses
We consider several sensitivity and exploratory analyses.To examine the robustness of our conclusions to the variant imputation analysis, we repeated the multivariate model on the subset of participants with successful sequencing.Two analyses explored the sensitivity of our conclusions to the variant identification, by restricting the multivariate analysis to the subset of those identified as being infected by the Ancestral variant only and by using Hamming distances in lieu of variant calls in the subset of participants with successful sequencing.To explore the sensitivity of our conclusions to the inclusions of negative PCR results, we repeated the multivariate analysis restricting to those with detectable viral load.The sensitivity analysis restricted to those enrolled in the US examined the impact of including international trial sites, primarily from the Janssen trial, on the primary conclusions.The sensitivity analysis limited to those enrolled in the Janssen trial explored the sensitivity of our conclusions within the largest trial.As an exploratory analysis, the multivariate model was fit including country-specific smoothed calendar time trends, to allow for potential confounding of viral load by local epidemic dynamics.In particular, the GAM model is an extension of the multivariate linear regression used in the primary analysis that flexibly models non-linear local temporal trends in COVID-19 incidence using cubic regression splines.

Severity prediction
In a post-hoc exploratory analysis, we examined the utility of log viral load at diagnosis in the prediction of severe COVID-19.This was only feasible for the Janssen trial, where two measures of viral load were considered: the log viral load at diagnosis (i.e., first illness associated swab) and the area under the longitudinal log viral load curve (VL-AUC), estimated using the trapezoidal rule over 28 days.Additional predictors included a full set of baseline participant characteristics and infection characteristics, summarized below:

Infection characteristics
Infecting variant, initial VL, AUC of VL trajectory (VL-AUC), days since onset VL measurements began Superlearner modeling, using the negative log-likelihood loss function, and a library of adaptive and non-adaptive learners and classifiers, was employed.Cross-validation was performed at two levels: five-fold outer level to compute the cross-validated area under the ROC curve (CV-AUC), and 5-fold inner level to estimate ensemble weights.CV-AUC and influence curve-based confidence intervals were computed for the ensemble model (Superlearner), discrete Superlearner, and the individual learners. 12arginal and conditional variable importance were assessed using the vimp package in R. 13 For predictive models that included a single covariate (either log VL at first illness associated PCR test, or VL-AUC), learner libraries included glm and gam (SL.mean,SL.glm, SL.gam).Predictive models adjusting for additional baseline covariates used a larger collection of learner libraries that also included glm interactions (SL.glm.interaction),elastic net (SL.glmnet; alpha=0, 0.25, 0.5, 0.75, 1 ), random forests (SL.ranger), and gradient-boosted machines (SL.xgboost).

Summary of univariate model results
Viral load at diagnosis was highly variable, with a median viral load of 6.18 log10 copies/mL (interquartile range 4.66-7.12log10 copies/mL).Among the three protocols that provided data that included undetectable viral loads, 6.3% (68/1073) of participants met this criterion.Distributions of log10 viral load and univariate analyses of factors associated with viral load at diagnosis are summarized in Figure 3 and eTable 2.
Parent protocol was univariately associated with viral load (adjusted p<0.01), with placebo cases in Janssen having an estimated 0.83 log10 copies/mL lower mean viral load relative to those in the reference Moderna protocol (95% CI: 1.04 to 0.62 lower) (Figure 3A, eTable 2).Additionally, country was associated with viral load: Colombia and South Africa had significantly lower mean viral loads compared to the reference US.Given that 95% of non-US cases were enrolled in the Janssen trial, these associations between country and protocol and viral load are likely related.
Other baseline factors univariately associated with viral load at diagnosis included participant race, having one or more comorbidities, and SARS-CoV-2 exposure risk (eTable 2).Similar viral load distributions were observed among participants who had severe disease vs. those with non-severe disease (Figure 3B).
Additionally, there were apparent differences in viral load among the SARS-CoV-2 variants (adjusted p<0.01).Viral loads corresponding to infections missing sequences were lower than those with sequences, with median viral loads of 3.62 and 6.48 log10 copies/mL, respectively (Figure 3C).This may be due to an inherent threshold for successful sequencing.The univariate analysis estimated Beta, Gamma, and Mu to have between 0.5 and 1.2 log10 copies/mL lower and Delta to have 0.28 log10 copies/mL higher mean viral load at diagnosis relative to Ancestral.However, there were just seven Delta infections in this cohort.In these univariate analyses, infecting variant explained approximately 23% of the variability in log10 viral load, although this was primarily attributable to the difference in viral load between individuals with and without sequences.
Thus, while several factors were found to be associated with viral load at diagnosis based on univariate analyses, none of the participant characteristics, beyond infecting variant, explained more than 3.7% of the observed variability.As a brief sensitivity analysis, we compared three univariate analyses of infecting variant (eTable 3).Overall, all three approaches provided similar estimates of mean difference in VL relative to the Ancestral variant.Estimates from imputation are comparable to those obtained in both the complete case and observed data analyses.All confidence intervals overlap, although there are some small differences in individual variant estimates.The most notable difference in estimates was seen in among Gamma infections, where the complete case analysis estimated the difference in mean VL at 0.49 log10 copies/mL lower than Ancestral (95% CI: 0.82 to 0.17 lower) and the imputation analysis estimated Gamma infections to be 0.22 log10 copies/mL lower compared to Ancestral (95% CI: 0.63 lower to 0.20 higher).
eTable 3. Comparison of model results from three univariate analysis of infecting variant.The observed data approach, which classifies missing variants as such; the complete case analysis, which limits the univariate analysis to the subset with successful sequencing; and the multiple imputation univariate analysis, which imputes missing variants based on the observed distribution of circulating variants near the swab collection, are compared.Multiple imputation results combine 20 imputations.

Multivariate model using Hamming distance
As an exploratory analysis, we also used the Hamming distance of spike sequences from placebo infections to the Ancestral SARS-CoV-2 strain as an alternative to the variant.For each analysis, we summarize the crossvalidated area under the ROC curve estimates and 95% confidence intervals for each of the candidate algorithms, in additional to the discrete and continuous SuperLearner.
In eFigure 9, the estimated marginal feature importance for all available covariates was plotted in descending order and neither viral load measurement broke the top 10 most important variables in predicting severe COVID-19.Participant characteristics, including age, race, country, and various comorbidities were ranked as higher importance than both viral load measurements, which is consistent with the literature; 14 variant was among the top-ranked variables marginally, although none of these features were found to be statistically significant.

eTable 3 .
The final set of features (79 features across 60 unique positions) used by the final predictive model (all positions indexed to the NC_045512 reference strain).
this refined set of features, we trained a final model on the full training set, using the same XGBoost parameters as before.

eFigure 1 .eFigure 2 .
Scatter plot of viral load at diagnosis (log10 copies/mL) by hamming distance of spike sequence to Ancestral SARS-CoV-2, for those placebo infections with successful sequencing.Summary of complete case analysis (multivariate model among those with variant data, no imputation) Estimated mean differences in SARS-CoV-2 viral load in nasal/NP swab at COVID-19 diagnosis, among those with successful variant calls (N = 1,323; adjusted R 2 = 0.106).Forest plot illustrating estimated mean difference in log10 copies/mL SARS-CoV-2 viral load between groups defined by participant or COVID-19 disease characteristics, based on multivariate regression analysis.95% confidence intervals and Holm-adjusted p-values are provided.Days since Covid-19 onset is defined as the number of calendar days between protocol-defined onset of COVID-19 and the specimen collection corresponding to diagnosis.

eFigure 3 . 5 .eFigure 7 .eFigure 8 .
Estimated mean differences in SARS-CoV-2 viral load in swabs at COVID-19 diagnosis on the subset of participants with quantifiable viral load at diagnosis, imputing missing variants (N=1,599; adjusted R 2 = 0.044).Forest plot illustrating estimated mean difference in log10 copies/mL SARS-CoV-2 viral load between groups defined by participant or COVID-19 disease characteristics, based on multivariate regression analysis with imputed variants.95% confidence intervals and Holm-adjusted p-values are provided.eFigure 4a.Summary of parametric results from GAM analysis with country-specific temporal regression splines (N = 1,667; adjusted R 2 = 0.081).Estimated mean difference in log10 viral load (VL) at diagnosis between each covariate category and the reference category, or per unit increase in the covariate in the case of continuous covariates, along with 95% confidence interval and nominal pvalue, based on univariate linear regression models.Median global nominal p-values from Wald Test to test coefficients for categorical variables with >2 levels.Adjusted p-values account for multiplicity and are corrected using the Holm method.Days since Covid-19 onset is defined as the number of calendar days between protocol-defined onset of COVID-19 and the specimen collection corresponding to diagnosis.eFigure 4b.Estimated country-level temporal smoothers.Ticks at bottom of each panel indicate contributing cases.Subgroup Sensitivity Analyses eFigure Multivariate linear regression of log10 viral load at diagnosis on the subset of participants infected with the Ancestral variant Only (N = 867; adjusted R 2 = 0.104).Estimated mean difference in viral load at diagnosis (log10 copies/mL) between each covariate category and the reference category, or per unit increase in the covariate in the case of continuous covariates, along with 95% confidence interval and nominal p-value, based on univariate linear regression models.Global nominal p-values from Wald Test to test coefficients for categorical variables with >2 levels.Adjusted p-values account for multiplicity and are corrected using the Holm method.eFigure 6.Estimated mean differences in SARS-CoV-2 viral load in swabs at COVID-19 diagnosis on the subset of participants living in the US, imputing missing variants (N=995; adjusted R 2 = 0.047).Forest plot illustrating estimated mean difference in log10 copies/mL SARS-CoV-2 viral load between groups defined by participant or COVID-19 disease characteristics, based on multivariate regression analysis with imputed variants.95% confidence intervals and Holmadjusted p-values are provided.Median p-values over multiple imputations are reported.Estimated mean differences in SARS-CoV-2 viral load in swabs at COVID-19 diagnosis on the subset of placebo participants enrolled in the Janssen trial, imputing missing variants (N=916; adjusted R 2 = 0.025).Forest plot illustrating estimated mean difference in log10 copies/mL SARS-CoV-2 viral load between groups defined by participant or COVID-19 disease characteristics, based on multivariate regression analysis with imputed variants.95% confidence intervals and Holm-adjusted p-values are provided.Predictors of Severe COVID-19 Viral load measurements alone were found to be poor predictors of severe COVID-19.Viral load at diagnosis had a CV-AUC of 0.52 (95% CI: 0.47 to 0.57) and the area under the viral load curve (VL-AUC) had a CV-AUC of 0.49 (95% CI: 0.42 to 0.57).The predictive performance of the ensemble model, discrete Superlearner, and individual learners are summarized in eFigure 8. Summary of the utility of log viral load at diagnosis (A) and the area under the VL trajectory curve (B) in predicting severe COVID-19.

eFigure 9 .
Marginal variable importance measures for all available baseline characteristics and covariates collected around the time of infection.Prediction of severe COVID-19 disease was improved with the inclusion of additional baseline characteristics, with a CV-AUC of 0.71 (95% CI: 0.67 to 0.75; eFigure 10).eFigure 10.Summary of multivariate predictors of COVID-19 severe disease.Cross-validated area under the ROC curve estimates and 95% confidence intervals for each of the candidate algorithms, in additional to the discrete and continuous SuperLearner.When adjusting for other baseline and infection characteristics, log viral load at diagnosis (IV1) had the highest-ranked conditional variable importance measure (eFigure 11).This suggests that after accounting for other known predictors, log viral load at diagnosis (IV1) does improve the prediction of severe COVID-19.However, the wide 95% confidence intervals suggest that the improvements in prediction with log viral load measurements are modest at best.eFigure 11.Estimated conditional variable importance measures and 95% CI for the features in the prediction of severe COVID-19.

eTable 2. Univariate linear regression results for placebo infections.
Estimated mean difference in log10 viral load (VL) at diagnosis between each covariate category and the reference category, or per unit increase in the covariate in the case of continuous covariates, along with 95% confidence interval and nominal p-value, based on univariate linear regression models.Global nominal p-values from Wald Test to test coefficients for categorical variables with > 2 levels.Adjusted p-values account for multiplicity and are corrected using the Holm method.Adjusted R 2 are included from the univariate model fit.